What makes a painting emotional? Finding the most effective handcrafted features for emotion classification of paintings

This notebook is a comprehensive summary of the findings of this project. It is broken into 4 parts.

0. Notebook setup
0.1 Load Libraries
0.2 Preprocess dataset
0.3 Examples from the dataset and ground-truth labels.

1. Introduce features
1.1 List of handcrafted features
1.2 Visualization of handcrafted features
1.3 Visualization of learned features
2. Rank handcrafted features in terms of predictive power
2.1 Rank handcrafted features using SHAP values
2.2 Rank handcrafted features using single-feature SVM
2.3 Rank handcrafted features using feature importance from Decision Tree
2.4 Combine all rankings
2.5 Rank handcrafted feature subgroups
2.6 Emotion-class-specific ranking of feature subgroups
3. Correlate features with features learned by a deep model.
3.1 Correlation between single handcrafted features and learned features
3.2 Correlation between handcrafted feature subgroups and learned features
4. Improve classifier accuracy by combining handcrafted features and learned features

Part 0 Notebook setup

Part 0.1 - Load libraries

<<<<<<< HEAD

Part 0.2 Preprocess dataset

We use a subset (30k) of the ArtEmis dataset(80k). This subset contains paintings that have the majority vote on a single emotion class. This subset allows us to predict discrete emotion labels instead of emotion distributions. Below shows a few examples from this subset:

=======

Part 0.2 Preprocess dataset

We use a subset (30k) of the Artemis dataset(80k). This subset contains paintings that have the majority vote on a single emotion class. This subset allows us to predict discrete emotion labels instead of emotion distributions. Below shows a few examples from this subset:

>>>>>>> e47d2701dd65b6222cb792e82a29612c2315d89a

Part 0.3 - Examples from the dataset and ground-truth labels.

Part 1 - Introduce features

Part 1.1 - Handcrafted vs Learned features, Low-level vs High-level features.

First of all, there are two types of features: handcrafted features - features proposed by human based on knowledge and intuition, and learned features - featured automitically learned by a deep model.

Within handcrafted features, there are low-level and high-level features, as defined in literature.

Low-level features are features regarding color, texture, lines, shapes, contrast, and similarly other semantic-free elements.

High-level features are features that capture the semantic meaning and the global composition of the painting, such as bounding boxes, number of faces, genre, style, and artist.

Below lists features from the two groups. Features are grouped into subgroups within low-level and high-level features. For example, one feature subgroup is "hue".

These features is a curation of proposed features in literature.

All low-level features except symmetry features, and the "face and skins" high-level feature, come from this research paper. We adpated the code from a third-party implementation: Machajdik J, Hanbury A. Affective image classification using features inspired by psychology and art theory[C]//Proceedings of the 18th ACM international conference on Multimedia. 2010: 83-92.

Symmetry features (Bilateral, Rotational, and Radial) come from the following paper. Zhao, S., Gao, Y., Jiang, X., Yao, H., Chua, T. S., & Sun, X. (2014, November). Exploring principles-of-art features for image emotion recognition. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 47-56).

High-level features:

Part 1.2 - Visualize handcrafted features

The following shows 2D approximation of our handcrafted features. We concatenated all the handcrafted features together.

The plot on the left shows points with positive vs negative labels, and the plot on the right shows points with multi-class labels. Note that the points do not refer to the same examples, as we subsampled to create a balanced dataset.

The visualization method we use is t-SNE. t-SNE is a non-linear dimensionality reduction method, trained to perserve the local clustering of the points.

Part 1.3- Visualize learned features

We obtained learned features by training a ResNet-34 model from scratch, without pretrained weights. The training task was multi-class classification. The training set was the full 80K artemis dataset. The accuracy obtained by the model on the test-set was 0.435.

Remarks: the top 5 signifcant pca components capture 99% of learned features(100D)'s variance, whereas the top 50 pca components only capture 72% of handcrafted features(126D)' variance.

Part 1.4 - Predictive ability of low-level and high-level features

Let's run a simple logistic regression model to see the performance of these features.

remarks:

Part 2 - Rank handcrafted features by their predictive power

Let's find out which handcrafted features have the highest predictive power.

We use three methods to do this.

We combine all rankings to get the most important handcrafted features.

Part 2.1 - Rank features with SHAP values

  1. We train an explainable model. We use logistic regression (which is the same as a SVM with linear kernel) because it's faster. We use the logistic regression model we trained in Part 1.4.
  2. We will find the average marginal contribution of the features to the final predictions of this model, which are called SHAP values.
  3. We will rank the features according to their SHAP values.

Visualize feature ranking. The plot on the top shows the top 10 features with the largest contributions to the final prediction. The bottom plot shows the ranking of the feature subgroups in terms of the sum over the shap values of all the features in the subgroup

To understand what SHAP values mean, we can visualize how each feature affects the prediction of a single example. If the shap value of a feature is negative, it moves the prediction line to the left, the confidence for the example belonging to the corresponding emotion class decreases, and vice-versa.

Part 2.2 - Rank features with single feature SVMs

Let's use a second method to cross-validate our SHAP feature ranking.

We use single handcrafted features to predict multi-class labels. We use their accuracy on the test set to rank their predictive power.

Part 2.3 - Rank features using feature importance from Decision Tree Classifier

As we see, SVM and SHAP give not-so-similar rankings. We now use a thrid method to cross-validate. We train a decision tree classifier and rank the features according to their feature importance value.

Part 2.4 - Combine all rankings

Note how toward the bottom the lines are less messy -- powerful features are considered powerful in most methods, less powerful features are more easily affected by method specific differences. Out of 177 handcrafted features, these 5 have significant larger influence over a painting’s emotion:

  1. GLCM homogeneity (saturation)
  2. Colorfulness
  3. genre_is_landscape
  4. Amount of black
  5. bbox PCA 0

We found the most predictive handcrafted features! Let's see how whether they positively or negatively affect the emotions of a painting. If there is positive correlation between the feature value and it's Shapley value for the positive class, then higher feature values correlates positively with positive emotion.

Now let's see per-class emotion contribution of each feature.

Part 2.5 - Ranking feature groups

We perform a similar analysis to feature groups instead of single features. This is helpful to for observing the aggregate effect of feature groups, especially for high-dimensional features such as artstyle and genre, as they are one-hot-encodings.

Feature group ranking is artstyle > hue > texture = genre > bbox > rule of third > saturation and brightness > radial symmetry> rotational symmetry = bilateral symmetry > faces and skin > lines.

Remarks:

Part 2.6 - Emotion-class-specific ranking of the feature groups

We have ranked feature groups in the previous section. Now we dive deeper and see which features are more relevant to the classification of a specific class. We do this through ranking features using one-class-versus-all SVM classification accuracy.

remarks:

Below, we carry the same one-class-versus-all task using learned features. The result suggests that handcrafted features are not biased. Instead, some classes are intrinsically harder to predict/harder for humans to label accurately.

Emotion class "amusement", "disgust", and "something else" are harder to predict. This agrees with our instinct -- those emotions are more nuanced and subjective.

Part 3 - Correlation between handcrafted features and features learned by a deep model

So far, we found which handcrafted features are more powerful. We used three methods to do this: SHAP, single-feature SVM, and decision tree.

Now, we investigate what a resnet34 model has learned when trained from scratch. The resnet model used randomly initiated weights, and was early stopped using cross-validation.

We use handcrafted features as landmarks to investigate what it has learned. Does it learn more low-level or more high-level features?

We carry out two methods:

  1. Train Multilayer Perceptron, SVM, and Decision Tree with combined learned and handcrafted features, and compare the relative improvement in prediction accuracy in all three classifiers. This gives us a rough idea for which handcrafted features provide new and useful information.

  2. Correlate single handcrafted features with learned features using linear regression, then rank handcrafted features according to the correlation coefficient of the learned linear mappings. This gives us a rough approximation for what information is encoded in the learned features.

Part 3.1 Correlation between handcrafted features and learned features

For feature groups, we can no longer learn just a single linear mapping since the feature groups form a N dim vector. For this reason, we use Canonical Correlation Analysis to learn N linear mappings that are uncorrelated with each other. We find the total variance explained of the N linear mappings.

The first column is the Pearson correlation coefficient between the handcrafted feature and the linear mapping of the learned features. The second column is the canonical communality coefficient of the feature, it represents how much vairance of the original feature was captured in all of the canonical functions. It informs one about how useful the observed variable was for the entire analysis. The thrid column is the improvement in prediction accuracy after concatenating the learned features with the single handcrafted feature. We take the average of the improvement in accracy of three separate models: SVM, Decision Tree, and Multilayer perceptron.

Note that the learned feature is 100 dimensional. We are modeling 1 dimentional features using 100 dimensional features. It is easy to get high correlation and hard to get weak correlation. If a feature has weak correlation with the learned features, it suggests that it models noise when considered as a singled out feature.

Part 3.2 Correlation between handcrafted feature groups and learned features

Let's analyse feature groups as a whole and see how much variance they share with the learned features.

Interpretation:

The above results suggests the importance of low-level features in the task of emotion classification. The learned features correlates more with low-level features than high-level features.

However, it is possible that all classification tasks depend on low-level features, not just emotion classification. It is also possible that low-level features are instrincally encoded in the pixel values and get carried through by convnet models. To verify that this is not the case, we look at the correlation between image representation from alexnet and vgg16, we see that they have smaller correlation coefficients with the handcrafted features compared to the image representations from the resnet model. This suggests that our resnet model, trained on the task of emotion classification of paintings retained more low-level information through out.

Part 4 - Improve model accuracy with handcrafted features

We concatenate feature combinations and train a MultiLayer Perceptron with 3 hidden layers, batch normalization and dropout (0.2). We also run a SVM on the same datasets for sanity check.

Remarks:

High-level features improve accuracy of learned representations, suggesting the pretrained resnet model did not pick up all the relevant semantic information. This is not surprising as the high-level features encode the artstyle, genre, and bbox of people, which are very hard to learned given a 80k dataset. More importantly, the task the resnet model was trained on was only emotion prediction.

Low-level features do not improve accuracy of learned representations as much as high-level features do, suggesting that they offer less new information to the learned representations.

Best accuracy is obtained by concatenating learned representation with handcrafted features and alexnet image representations. Alexnet is trained on ImageNet classification, and it's image representation capture high-level semantic imformation.